-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use cooperative groups to populate Associations (Histograms) in Pixel Patatrack #35713
base: master
Are you sure you want to change the base?
Conversation
namespace cms { | ||
namespace cuda { | ||
|
||
template <template <CountOrFill> typename Func, typename Histo, typename... Args> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used (yet?) It may make the syntax more complex, not simpler
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35713/26023
|
A new Pull Request was created by @VinInn (Vincenzo Innocente) for master. It involves the following packages:
@jpata, @cmsbuild, @fwyzard, @makortel, @slava77 can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
auto kernel = fillManyFromVectorCoopKernel<Histo, T>; | ||
auto nblocks = (totSize + nthreads - 1) / nthreads; | ||
assert(nblocks > 0); | ||
auto nOnes = view.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, a huge stack of boiler plate. could be partially encapsulated in a "launch" interface as in launch.h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to give it a try, there is launch_cooperative(...)
in launch.h .
I don't think I've ever tested it, though.
template <typename T> | ||
inline T __ldg(T const* x) { | ||
return *x; | ||
} | ||
|
||
namespace cooperative_groups { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from @fwyzard contribution to patatrack-alone
#define GET_COOP_RED_FACT_FROM_ENV | ||
|
||
// to drive performance assessment by envvar | ||
#ifdef GET_COOP_RED_FACT_FROM_ENV |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes life easy. Not supposed to be used in production.
#include <cstdlib> | ||
|
||
template <typename F> | ||
inline int maxCoopBlocks(F kernel, int nthreads, int shmem, int device, int redFact = 10) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be moved to CUDAService?
MUST be called at max once per job (per device? per kernel?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does CUDA itself require that it must be called at most once per job, or calling it many times would be slow (either by itself or it causes synchronization)?
Either way, a good question. Probably the best place to cache the values would be in CUDAService
. Can the number of threads per block and size of shared memory per block vary between events (in general)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked: major slowdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would go well with a more general development I've been thinking about for a while (and that Abdulla may work on, if he comes to CERN in January): making the launch configuration of each kernel configurable, with a common interface.
@@ -183,6 +184,52 @@ namespace cms { | |||
co[i] += psum[k]; | |||
} | |||
} | |||
|
|||
template <typename T> | |||
__device__ void coopBlockPrefixScan(T const* ici, T* ico, int32_t size, T* ipsum) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really faster than the above (at least if all required blocks are available)
int maxBlocks = maxCoopBlocks(populate, nThreads, 0, 0, 0); | ||
std::cout << "max number of blocks is " << maxBlocks << std::endl; | ||
auto ncoopblocks = std::min(nBlocks, maxBlocks); | ||
auto a1 = v_d.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one cannot get a pointer to the return value of .get()
auto env = getenv("COOP_RED_FACT"); | ||
int redFactFromEnv = env ? atoi(env) : 0; | ||
if (redFactFromEnv != 0) | ||
redFact = redFactFromEnv; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a "global reduction factor" to reduce the number of required blocks to launch a cooperative groups.
Maybe shall be tuned kernel by kernel: a bit of a mess...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this: why a "reduction factor" ?
If it needs to be tuned kernel by kernel, the effect is the same a setting a hard limit on the number of blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I hope that there will be no need for a tune kernel by kernel to get reasonable performance for any kind of wf and event size/type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, on the T4 and V100, what are the maximum number of blocks reported by CUDA ?
using View = caConstants::TupleMultiplicity::View; | ||
View view = {tupleMultiplicity_d, nullptr, nullptr, -1, -1}; | ||
|
||
int blockSize = 128; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated boilerplate.
The effort of factorization may be waisted if one would decide to get rid of the TupleMultiplicity container and filter multiplicity in the fit routine....
@cmsbuild , please test |
Milestone for this pull request has been moved to CMSSW_14_1_X. Please open a backport if it should also go in to CMSSW_14_0_X. |
ping |
Milestone for this pull request has been moved to CMSSW_14_2_X. Please open a backport if it should also go in to CMSSW_14_1_X. |
ping (to make bot change milestone) |
Milestone for this pull request has been moved to CMSSW_15_0_X. Please open a backport if it should also go in to CMSSW_14_2_X. |
In this PR I wish to share code that used cooperative groups to reduce the number of kernels used to populate "Histograms" (actually OneToMany Associations) in Patatrack.
In unit tests (single Thread) the gain in speed is noticeable (even in just the prefix scan).
In standard multithread multi-stream workflows a loss in throughput can easily be observed if the maximum number of blocks is allocated. Some fine tuning of the number of blocks allocated to each kernel (even just one block?) makes this PR at least as fast as the standard multi-kernel implementation.
More comments inline.
The code is "configured" to run with cooperative groups: of course the actual PR can be merged with the standard multi-kernel implementation as default.